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Abstract 

Compressive sensing (CS) exploits sparsity to recover sparse or compressible signals from dimensional¬ 
ity reducing, non-adaptive sensing mechanisms. Sparsity is also used to enhance interpretability in machine 
learning and statistics applications: While the ambient dimension is vast in modern data analysis problems, 
the relevant information therein typically resides in a much lower dimensional space. However, many solu¬ 
tions proposed nowadays do not leverage the true underlying structure. Recent results in CS extend the simple 
sparsity idea to more sophisticated structured sparsity models, which describe the interdependency between the 
nonzero components of a signal, allowing to increase the interpretability of the results and lead to better recov¬ 
ery performance. In order to better understand the impact of structured sparsity, in this chapter we analyze the 
connections between the discrete models and their convex relaxations, highlighting their relative advantages. 
We start with the general group sparse model and then elaborate on two important special cases: the dispersive 
and the hierarchical models. For each, we present the models in their discrete nature, discuss how to solve the 
ensuing discrete problems and then describe convex relaxations. We also consider more general structures as 
defined by set functions and present their convex proxies. Further, we discuss efficient optimization solutions 
for structured sparsity problems and illustrate structured sparsity in action via three applications. 
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1 Introduction 


Information in many natural and man-made signals can be exactly represented or well approximated by a 
sparse set of nonzero coefficients in an appropriate basis (90j. Compressive sensing (CS) |24|40| 120| exploits this 
fact to recover signals from their compressive samples through dimensionality reducing, non-adaptive sensing 
mechanisms. From a different perspective, sparsity is used to enhance interpretability in machine learning 
[53 58] and statistics [ 123 [ applications: While the ambient dimension is vast in modern data analysis problems, 
the relevant information therein typically resides in a much lower dimensional space. This conclusion has 
lead to several new theoretical and algorithmic developments in different communities, including theoretical 
computer science [52], neuronal imaging (57], 741, gene expression inference [108,1211, bioinformatics [112,141] 
and X-ray crystalography (23|70|94| . In all these disciplines, given a data set to analyze, one is usually interested 
in the simplest model that well explains the observations. 

Apart from the rather mature theory describing the sparse signal approximation problem, its success lies 
also in the computational tractability of such task. In the CS regime, Donoho [ |40| and Candes et al. [24 ] utilize 
simple convex optimization—i.e., linear programming solvers or quadratic programming methods—and ma¬ 
trix isometry assumptions over sparsely restricted sets to compute a sparse approximation from a limited set of 


tentatively and in contrast to the conventional convex relaxation 
, 80,99] rely on the discrete structure of the sparse signal approxi¬ 
mation problem to find a solution and are characterized by identical approximation guarantees as in |24[|40). 


linear measurements in polynomial time. A 
approaches, iterative greedy algorithms [ 46} 


Moreover, their overall complexity matches, if not improves, the computational cost of convex approaches. 


1.1 Why structured sparsity? 

While many of the optimization solutions proposed nowadays result in sparse model selections, in many cases 
they do not capture the true underlying structure to best explain the observations J8|. In fact, this uninformed 
selection process has been the center of attention in sparse approximation theory |8lj[i38], because it not only 
prevents interpretability of results in many problems, but also fails to exploit key prior information that could 
radically improve estimation performance. 

Recent results in CS extend the simple sparsity idea to consider more sophisticated structured sparsity mod¬ 
els, which describe the interdependency between the nonzero coefficients, increase the interpretability of the 
results and lead to better recovery performance (8|l5p4pll [. Nowadays, we have witnessed aplenty elaborate 
approaches that guide the selection process: (overlapping) group Lasso, fused Lasso, greedy approaches for 
signal approximation under tree-structure assumptions just to name a few; see [47.73,124,137 [. To show the 
merits of such approaches, consider the problem of image recovery from a limited set of measurements using 
the tree-structured group sparse model \6\. Figure [T] shows the performance of structured sparsity models, 
compared to simple ones. 

Moreover, such a priori model-based assumptions result into more robust solutions and allow recovery with 
far fewer samples, e.g. 0(k) samples for k sparse signals whose non-zero coefficients are arranged into few 
blocks or form a rooted connected subtree of a given tree over the coefficients [8|. To highlight the importance 
of this property, in the case of Magnetic Resonance Imaging (MRI), reducing the total number of measurements 
is highly desirable for both capturing functional activities within small time periods and rendering the whole 
procedure less "painful" for the patient | 


2 Preliminaries 

Plain lowercase and uppercase letters represent scalars while boldface lowercase letters represent vectors. We 
reserve boldface uppercase letters for matrices. Calligraphic uppercase letters are mostly used to denote sets. 
The i -th entry of a vector w is denoted as w t . We use superscripts such as w 2 to denote the estimate at the i-th 
iteration of an algorithm. The sign of a scalar a G R is given by sign (a). 

Given a set 8 C 3Sf := {1,..., n}, the complement S c is defined with respect to N, and its cardinality as 
|S|. The support set of w is supp(w) = {i G N : Wi 0}. Given a vector w G M n , w§ is the projec¬ 
tion (in M n ) of w onto 8, i.e. (w§) §c = 0, whereas wj§ G lRl s l is w limited to 8 entries. We define £& := 
{w : w G M n , |supp(w)| < k} as the set of all k- sparse vectors in n-dimensions; with a slight abuse of nota¬ 
tion, we use £*; to further denote the set of k- sparse supports in N. We sometimes write x G to mean 
supp(x) G Efc. 
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Original Simple sparsity Structured sparsity 



PSNR: 20.43 dB PSNR: 23.56 dB 



PSNR: 28.43 dB PSNR: 33.24 dB 


Figure 1: Empirical performance of simple and structured sparsity recovery on natural 
images. In all cases, the number of linear measurements are 5% (top row) and 10% (bottom 
row) of the actual image dimensions. Left panel: Original images of dimension: (Top row) 
2048 x 2048, (Bottom row) 512 x 512. Middle panel: Conventional recovery using simple 
sparsity model. Right panel: Tree-structured sparse recovery. 


We use B n to represent the space of n-dimensional binary vectors and define l : W 1 ® n to be the indicator 
function of the nonzero components of a vector in M n , i.e., t(x)i = 1 if Xi ^ 0 and t(x)i = 0, otherwise. We let 
t n to be the n-dimensional vector of all ones, l n? § the n-dimensional vector of all ones projected onto S and I n 
the n x n identity matrix; we often use I when dimension is clear from the context. 

Norms: For completeness, we repeat some of the norm definitions presented in the Introduction chapter of this 
book. We define the f^-norm in n-dimensions as: 


IMIp = 


(EHiW p ) 1/P if P 6 (0, oo), 

maxi \xi\ if p = oc. 


The £q pseudo-norm is defined as: ||x||o := |supp(x)|. 

Projections operations: Given a discrete set with sparsity level k and an anchor point x e M n , a key 
operation in our subsequent discussions is the following projection problem: 

PmJx) e argmin {||w - x||| | supp(w) e M fc } . (1) 

wGM n 


Proximity operations: Consider q(-) : M n —M is a revularizer function. We define the proximity operator of 
<?(•) as (32} eq. (2.13)]: 


prox^(x) := argmin 

weM n 


1 

2 


w-x 


+ A-tf(w) 


(2) 


where A > 0 is the regularizer weight. 

Optimization preliminaries: Existing algorithmic solutions invariably rely on two structural assumptions on 
the objective function that particularly stand out among many others: the Lipschitz continuous gradient assump- 
tionthe strong regularity condition. 
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Definition 2.1 (Lipschitz gradient continuity) Let f : R n R be a convex, smooth differentiable function. Then, f is 
a smooth Lipschitz continuous gradient function if and only if for any v, w G dom(f): 

l|V/(v) - V/(w)|| 2 < i'llv - w|| 2 , 


for some global constant L > 0. 

Definition 2.2 (Strong regularity condition) Let f : W 1 M be a L-Lipschitz convex, doubly differentiable function. 
Then, f is strongly convex if and only if: 


/il < V 2 /(x) < LI, Vx e dom(f), 


for some global constant p > 0. 


3 Sparse group models 


We start our discussion with the group sparse models, i.e., models where groups of variables are either selected 
or discarded together [9 66 73,108 110 111 . These structures naturally arise in applications such as neuroimag¬ 
ing 157,74), gene expression data ]108||l21 , bioinformatics ]1121141| and computer vision )8}[26). For example. 


in cancer research, the groups might represent genetic pathways that constitute cellular processes. Identifying 
which processes lead to the development of a tumor can allow biologists to directly target certain groups of 
genes instead of others 11211. Incorrect identification of the active/inactive groups can thus have a rather dra¬ 
matic effect on the speed at which cancer therapies are developed. Figures |2p| illustr ate some more applications 
of group sparse models used in practice. 



Figure 2: Image segmentation 
application: the signal of inter¬ 
est includes human activity, ex¬ 
pressed in groups. 


Figure 3: Example of a mouse brain where 
brain regions are represented as groups of 
voxels 1861 (©2014 Allen Institute for Brain 
Science). 


Such group sparsity models—denoted as 0—feature collections of groups of variables that could overlap 
arbitrarily; that is 0 = {Si, • • •, Sm} where each S j is a subset of the index set N := {1,... ,n}. Arbitrary 
overlaps means that we do not restrict the intersection between any two sets S j and St from 0, j L 

We can represent a group structure 0 as a bipartite graph, where on one side we have the n variables nodes 
and on the other the M group nodes. An edge connects a variable node i to a group node j if i G Sj- The 
bi-adjacency matrix A G B nxM of the bipartite graph encodes the group structure. 


A f 1, if i e Sj; 

lJ {0, otherwise. 

Figure [5] shows an example. 

Another useful representation of a group structure is via a group graph (V, £) where the nodes V are the 
groups S G 0 and the edge set £ contains if Si n Sj f 0 , that is an edge connects two groups that overlap. A 
sequence of connected nodes v 2 , ..., v n , is a loop if v\ = v n . For an illustrative example see Figure[5] 

An important group structure is given by loopless pairwise overlapping groups because it leads to tractable 
projections j5 |. This group structure consists of groups such that each element of the ground set occurs in at 
most two groups and the induced graph does not contain loops. Therefore the group graph for these structures 
is actually a tree or a forest and hence bipartite; see Figure [6] 
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Figure 4: Consider the group structure (5 1 defined by the following groups. Si = {1}, 

s 2 = {2}, S 3 = {1, 2,3,4,5}, 34 = {4,6}, S 5 = {3,5,7} and Se = {6,7,8}. 



Figure 5: Bipartite group graph with loops induced by the group structure 6 1 described 
in Figure |4j where on each edge we report the elements of the intersection. 


Given the above and for a user-defined group budget G G Z +/ we define the group model Mg as Mg := 
{U Se eJ S £,J C 0, | J| < G}, that is all sets of indexes that are the union of at most G groups from the collection 
0. Then, the projection problem ([lj becomes: 

X =: fM G (x) S argmin {||x - z||| I supp(z) e M G } ■ (3) 

zGM n 

Moreover, one might be only interested in identifying the group-support of the approximation x, that is the G 
groups that constitute its support. We call this the group-sparse model selection problem. 

3.1 The discrete model 

According to we seach for xGl n such that ||3c — x||| is minimized, while x does not exceed a given group 
budget G. A useful notion in the group sparse model is that of the group £o-"norm": 

Auj > l( w) | ; (4) 

here, A denotes the adjacency matrix as defined in the previous subsection, the binary vector oo indicates which 
groups are active and the constraint Act? > i( w) makes sure that, for every non-zero component of w, there 
is at least one active group that covers it. Given the above definitions, the group-based signal approximation 
problem can be reformulated as 

x G argmin {||w — x||| | ||w|| 0> o < G} . (5) 

weR N 

One can easily observe that, in the case where we already know the group cover of the approximation x, 
we can obtain x as xj = xj and xjc = 0, where 3 = |Jges G (x) 3/ with S G (x) denoting the group support of x 
and 3 C = N \ 3. I.e., if we know the group support of the solution, the entries' values are naturally given by the 


w 


e,o := mm 

u;GB m 



S 2 S 2 


Si S 3 Si S 3 

Figure 6: Consider Si = {1,2,3}, S 2 = {3,4,5}, S 3 = {5,6, 7}, which can be represented 
by the graph in the left plot. If S 3 were to include an element from Si, for example {2}, we 
would have the loopy graph of the right plot. Note that 6 1 is pairwise overlapping, but 
not loopless, since S 3 , S 4 , S 5 and S6 form a loop. 
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anchor point x. The authors in |5j prove that the group support can be obtained by solving a discrete problem, 
according to the next lemma. 

Lemma 3.1 Q Given x e R N and a group structure 0, the group support of the solution x— denoted as S G (x) = 
{S j e 0 : oof = 1}— is given by the solution (u> G , y G ) of the following binary maximization problem: 

{ N M 

yixj : Au> > y, ujj < G 

i=1 3 =1 

Here, y denotes the selected variables, while u the active groups. Thus, the constraint Aca > y guarantees that, 
for every selected variable, there is at least one group that covers it. 

The problem in ([§]) can produce all the instances of the weighted maximum coverage problem (WMC) 1100], 
where the weights for each element are given by x\ (1 < i < n) and the index sets are given by the groups 
Sj E 0 (1 < j < M). Since WMC is in general NP-hard 11001 and given Lemma 1, finding the groups that cover 
the solution x is in general NP-hard. 

However, it is possible to approximate the solution of § using the greedy WMC algorithm 11011. At each 
iteration, the algorithm selects the group that covers new variables with maximum combined weight until G 
groups have been selected. To the best of our knowledge, only the work in [5] presents a polynomial time 
algorithm for solving © for loopless pairwise overlapping groups structures, using dynamic programming. 
The algorithm gradually explores the group graph, which in this case is a tree or a forest of trees, from the 
leaves towards the root, storing the best solutions found so far and dynamically updating them until the entire 
tree is examined. 



3.2 Convex approaches 

Recent works in compressive sensing and machine learning with group sparsity have mainly focused on lever¬ 
aging the group structures for lowering the number of samples required for recovering signals |8}|15J|44||67J|69 
108,111,1191. 


For the special case of non-overlapping groups, dubbed as the block-sparsity model, the problem of model 
selection does not present computational difficulties and features a well-understood theory 11191. The first 
convex relaxations for group-sparse approximation (137) considered only non-overlapping groups: the authors 
proposed the Group LARS (Least Angle Regression) algorithm to solve this problem, a natural extension 
of simple sparsity LARS algorithm (42) . Using the same algorithmic principles, its extension to overlapping 
groups 11401 has the drawback of selecting supports defined as the complement of a union of groups, even 
though it is possible to engineer the groups in order to favor certain sparsity patterns over others |73|. Eldar 
et al. [44] consider the union of subspaces framework and cast the model selection problem as a block-sparse 
model selection one by duplicating the variables that belong to overlaps between the groups, which is the 
optimization approach proposed also in |69|. Moreover, |[44j considers a model-based pursuit approach j3lj as 
potential solver for this problem, based on a predefined model . For these cases, one uses the group lasso 


norm 


Ei 

SG0 


c ls lip ’ 


(7) 


where X| g is the restriction of x to only the components indexed by S- 

In addition, convex proxies to the group ^o-norm @ have been proposed (e.g., |69]) for finding group-sparse 
approximations of signals. Given a group structure 0, an example generalization is defined as 


Il x ||0,{l,p} : ~ 


inf 

Vi,. .., V M G M n 
Vz,supp(vd = Sz 



M 


E v *= : 


( 8 ) 


i= 1 


where ||x|| p = is the f^-norm, and dj are positive weights that can be designed to favor certain 

groups over others [ 108[. This norm can be seen as a weighted instance of the atomic norm described in 
@[111], based on the seminal work of |3l) on the sparse synthesis model. There, the authors leverage convex 
optimization for signal recovery, but not for model selection: Let A := {ai, a 2 , • • • | Gl n , Vi} be the atomic 
set of group sparse signals x that can be synthesized as a k -sparse combination of atoms a i in A, i.e.. 


x = Qa^, Ci > 0, a.i E A and ||c|| 0 < k. 


i=1 


Here, a 4 e M n , supp(ai) = Si and ||a^ || 2 = 1. We can then define the atomic norm: 


ll x IU 


= inf < 


W \A\ 

i E Ci I x = E c * ai ’ Ci - °> Va * e ^ 

[ i =1 i=1 


(9) 


( 10 ) 
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Lemma 3.2 If in © the weights are all equal to 1 (di = 1, V i), we have 

ll x IU = ll x lk{i,p} • 

The group-norm © can also be viewed as the tightest convex relaxation of a particular set function related 
to the weighted set-cover (see Section[6]and |107|). 

One can in general use § to find a group-sparse approximation under the chosen group norm 

xe argmin {||w — x|| | : ||w|| 0){1>p} < A}, (11) 

where A > 0 controls the trade-off between approximation accuracy and group-sparsity. However, solving 
© does not necessarily yield a group-support for x: even though we can recover one through the decom¬ 
position {v- 7 } used to compute \\x\\®gi :P }, it may not be unique and when it is unique it may not capture the 
minimal group-cover of x 11081. Therefore, the equivalence of £o and minimization [24j[40]| in the standard 
compressive sensing setting does not generally hold in the overlapping group-based setting. 

The regularized version of problem fTT} is equivalent to the proximity operator of ||x||. Recently, (%} 
131| proposed an efficient algorithm for this proximity operator in large scale settings with extended overlap 
among groups. In this case, the proximity operator involves: (i) an active set preprocessing step 11351 that 
restricts the prox operations on a subset of the model—i.e., "active" groups and, (ii) a dual optimization step 
based on Bertsekas' projected Newton method |12|; however, its convergence requires the strong regularity 
of the Hessian of the objective near the optimal solution. The authors in [1] propose a fixed point method to 
compute the proximity operator for a wide range of group sparse model variants, given the model-structured 
mapping of the fixed-point Picard iterations is non-expansive. 11361 develops an efficient primal-dual criterion 
for the overlapping group sparse proximity operator: the solution of its dual formulation is used to compute 
the duality gap and tweak the accuracy of the solution and, thus, the convergence of the algorithm. 


3.3 Extensions 


In many applications, such as genome-wide association studies 11411 and multinomial classification 11331, it is 
desirable to find approximations that are not only group-sparse, but also sparse in the usual sense (see 11181 
for an extension of the group lasso). A classic illustrative example is the sparse group lasso approach m a 
regularization method that combines the lasso 11231 and the group lasso (93) . 

From a discrete pespective, the original problem © can be generalized by introducing a sparsity constraint 
k and allowing to individually select variables within a group. The generalized integer problem then becomes 


max 

B M , yGB 71 


( n 

n 

M 

{ ViX 2 i 

Au> > y, y] Vi < k, 

yyj < g > 


( 12 ) 




i= 1 


3 = 1 


This problem is in general NP-hard too, but it turns out that it can be solved in polynomial time for the 
same group structures that allow to solve ©• 

Proposition 1 [5] Given a loopless group structure 0, there exists dynamic programming algorithm that solves © 
with complexity linear in n and k. 


From a convex point of view, Simon et al. 11181 are the first to propose the sparse group model regularization 
in the context of linear regression. Given a group model 0 and constants Ai, A 2 > 0, they consider the problem 


minimize 

xGM n 


|/( x ) + Ai y^ ii x i 9 lb + A 2 |i x iii}- 


For this purpose, a generalized gradient descent algorithm is proposed. 11331 provides an efficient and robust 
sparse group lasso algorithm: in this case, a penalized quadratic approximation of the loss function is optimized 
via a Newton type algorithm in a coordinate descent framework 11361. 


4 Sparse dispersive models 

To describe the dispersive structure, we motivate our discussion with an application from neurobiology. Living 
beings behave and function via transmission of electrical signals between electrically excitable neuronal brain 
cells. Such chemical "information" causes a swift change in the electrical potential of a possibly discharged 
neuron cell, which results in its electrical excitation. Currently, we are far from understanding the grid of 
neurons in its entirety : large-scale brain models are difficult to handle while complex neuronal signal models 
lead to non-interpretable results. 
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Figure 7: Neuronal spike train example. 


Inspired by the statistical analysis in [511, the authors in [ 63 [ consider a simple one-dimensional model, 
where the neuronal signal behaves as a train of spike signals with some refractoriness period A > 0: there is a 
minimum nonzero time period A where a neuron remains inactive between two consecutive electrical excita¬ 
tions. In statistical terms, neuronal signals are defined by a inter-spike interval distribution that characterizes 
the probability a new spike to be generated as a function of the inter-arrival time. Figure [7] illustrates how a 
collection of noisy neuronal spike signals with A > 0 might appear in practice. 


4.1 The discrete model 

We provide next a formal definition of the sparse dispersive model: 

Definition 4.1 (Dispersive model) We define the dispersive model T) k in n-dimensions with sparsity level k and re¬ 
fractory parameter A G Z + as: 

Bfc = {§9 I Vg, S q C N, |S,| < hand \i - j\ > A,Vi S 9 } , (13) 

i.e., T) k is a collection of index subsets S q in N with number of elements no greater than k and with distance between the 
indices in S q greater than the interval A. 

We note that if there are no constraints on the interval of consecutive spikes, the dispersive model naturally 
boils down to the simple sparsity model E&. 

Given the definition above, the projection 0 is: 

J > D fe (x) G argmin {||w — x||| | supp(w) G D fe } . (14) 

wei n 

Let u G B n be a support indicator binary vector, i.e., u represents the support set of a sparse vector x such that 
supp(u;) = supp(x). Moreover, let D G ]g( n - A + 1 ) xn such that: 


D = 


"1 

0 


1 ••• 1 
1 1 •• 


1 0 0 
1 1 0 


0" 

0 


(15) 


0 


0 0 11 


1 


1 


(n—A+l) xn 


Here, per row, there are A consecutive ones that denote the time interval between two potential consecutive 
spikes. Finally, let b G M n-A+2 such that b := [k 1 1 • • • 1 l] T . 

According to [ 63 [, the following linear support constraints encodes the definition of the dispersive model 

D,: 


A := 


1 

D 


ca < b. 


One can observe that T) k = {Uvu;e3 SU PP ( u; ) | 3 := E : Acj < b}} . To this end, ( [T4] > becomes: 

^T)fc(x) G argmin {||w - x||| | A • supp(w) < b} . 

wGM n 

A key observation is given in the next lemma. 


(16) 


(17) 
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Lemma 4.2 [63] Given the problem setting above, it is easy to observe that © has solution T;D fc (x) such that 8 := 
supp (J , D fc (x)) and (Tx> fc (x)) s = where: 

8 G supp ( argmax {c T tu} ) , where c := \x\ x\ • • • x^] T , (18) 

\<*;EB n : Aw<b J 

i.e., we target to capture most of the signal's x energy, given structure T) k . To solve ( [18] }, the authors in [63] identify 
that the binary integer program ( [T8| is identical to the solution of the linear program, obtained by relaxing the integer 
constraints into continuous constraints. 


To this end. Lemma 4.2 indicates that fj 7\ can be efficiently performed using linear programming tools [201. 
Once ( p~7| is relaxed to a convex problem, decades of knowledge on convex analysis and optimization can be 
leveraged. Interior point methods find a solution with fixed precision in polynomial time but their complexity 
might be prohibitive even for moderate-sized problems. 


4.2 Convex approaches 


The constraint matrix D describes a collection of groups, where each group is assumed to have at most one non¬ 
zero entry to model the refractoriness propertyr] Moreover, these groups are overlapping which aggrandizes the 
"clash" between neighboring groups: a non-zero entry in a group discourages every other overlapping group 
to have a distinct non-zero entry. 

In mathematical terms, each row i of D defines a group Si such that Si = supp(d^) C N where d^ denotes 
the i -th row of D, Vi G {1,.. ., M := n — A + 1}: 


D = 


A 

I gi I 

I 62 


Gm 


Given such group structure, the dispersive model is characterized both by inter-group and intra-group proper¬ 
ties: 

• Intra-group sparity : we desire ||Dcu||oo < 1, i.e., per refractoriness period of length A, we require only one 
"active" spike. 

• Inter-group exclusion : due to the refractoriness property, the activation of a group implies the deactivation 
of its closely neighboring groups. 

While the sparsity level within a group can be easily "convexified" using standard ^i-norm regularization, 
the dispersive model further introduces the notion of inter-group exclusion, which is highly combinatorial. How¬ 
ever, one can relax it by introducing competitions among variables in overlapping groups: variables that have a 
"large" neighbor should be penalized more than variables with "smaller" neighbors. 

In this premise and based on [ 142 [, we identify the following family of norm^] 

^exclusive( x ) = ^ I "G \xj\ j , P = 2, 3, , (20) 

S i \ j G S i J 


as convex regularizers that imitate the dispersive model. In pO) , \xj\J := ||x s Ji promotes sparsity 

within each group Si, while the outer sum over groups ||xg. ||\ imposes sparsity over the number of groups 
that are activated. Observe that for p = 1, j2Q] } becomes the standard f?i-norm over N. Notice that the definition 
of the overlapping groups (instead of non-overlapping) is a key property for capturing the discrete structure: 
variables belonging to overlapping groups are weighted differently when considered parts of different groups. 
This leads to variable "suppression" (i.e., thresholding) of elements, depending on the "weight" of their neigh¬ 
borhood within the groups they belong to. 


1 Other convex structured models that can be described as the composition of a simple function over a linear transformation D can be 
found in | 1 |. 

2 The proposed norm originates from the composite absolute penalties (CAP) convex norm, proposed in jl401, according to which: 


9( x ) = L 

Si 



(19) 


for various values of 7 and p. Observe that this model also includes the famous group sparse model where <?(x) = ||xg. || 2 , described 
in Section[3] for p — 1/2 and 7 = 2 . 


9 














Figure 8: Wavelet coefficients naturally cluster along a rooted connected subtree of a reg¬ 
ular tree and tend to decay towards the leaves. (Left) Example of wavelet tree for a 32 x 
32 image. The root of the tree is the top-left corner and there are three regular subtrees re¬ 
lated to horizontal, vertical and diagonal details. Each node is connected to four children 
representing detail at a finer scale. (Centre) Grayscale 512 x 512 image. (Right) Wavelet 
coefficients for the image at center. Best viewed in color, dark blue represents values closer 
to zero. 



Figure 9: Hierarchical constraints. Each node represent a variable. (Left) A valid selection 
of nodes. (Right) An invalid selection of nodes. 


5 Hierarchical sparse models 


Hierarchical structures are found in many signals and applications. For example, the wavelet coefficients of 
images are naturally organized on regular quad-trees to reflect their multi-scale structure. Figure [8] and ||6}{8p3} 
61,. 67|90pl6 ,1401; gene networks are described by a hierarchical structure that can be leveraged for multi-task 
regression p7 1; hierarchies of latent variables are typically used for deep learning [111. 

In essence, a hierarchical structure defines an ordering of importance among the elements (either individual 
variables or groups of them) of a signal with the rule that an element can be selected only after its ancestors. 
Such structured models result into more robust solutions and allow recovery with far fewer samples. In com¬ 
pressive sensing, assuming that the signal possesses a hierarchical structure with sparsity k leads to improved 
sample complexity bounds of the order of 0(k) for dense measurement matrices [8 |, compared to the bound of 
0(k\og{n/k)) for standard sparsity. Also in the case of sparse measurement matrices, e.g. expanders, hierar¬ 
chical structures yield improved sample complexity bounds 


5.1 The discrete model 

The discrete model underlying the hierarchical structure is given by the next definition. 

Definition 5.1 (Hierarchical model) Let 7 denote an arbitrary tree or forest representation over the variables in a set 
N. We define a k rooted connected (RC) subtree S with respect to 7 as a collection of k variables in N such that v e S 
implies A(v) e S, where A{v) is the set that contains all the ancestors of the node v. 

The hierarchical model of budget k,7 k is the set of all k rooted-connected subtrees of 7. 

An example of rooted connected subtree over |K| = 9 variables is given in Figure [9] 

Given a tree 7, the rooted connected approximation is the solution of the following discrete problem 

3V fc (x) G argmin {||x - z\\% | supp(z) G T fc } . (21) 
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Figure 10: Examples of parent-child and family models. Active groups are indicated by 
dotted ellipses. The support (black nodes) is given by the union of the active groups. (Left) 
Parent-child model. (Right) Family model. 


which can be reformulated as follows 


( 22 ) 

where y is a binary vector with k non-zero components that indicates which components of x are selected. 
Given a solution y of the above problem, a solution z of pT] } is then obtained as Z| s = X| s and Z| = 0, where 
S = supp(y). 

This type of constraint can be represented by a group structure with an overall sparsity constraint k, where 
for each node in the tree we define a group consisting of that node and all its ancestors. When a group is 
selected, we require that all its elements are selected as well. Problem ( [22] } can then be cast as a special case of 
the Weighted Maximum Coverage problem ([§]}. Fortunately, this particular group structure leads to tractable 
solutions. 

Indeed ( [22] } can be solved exactly via a dynamic program that runs in polynomial time [|5p5|. For d -regular 
trees, that is trees for which each node has d children, the algorithm in |5| has complexity 0(nkd). 


( N 

y <= argmax l V y t x' 2 t : y <= 7 k 

y£B« 


5.2 Convex approaches 


The hierarchical structure can also be enforced by convex penalties, based on groups of variables. Given a tree 
structure T, define groups consisting of a node and all its descendants and let represent the set of all these 
groups. Based on this construction, the hierarchical group lasso penalty [75,77,140] imitates the hierarchical 
sparse model and is defined as follows 


^(x)hgl= w sll x |gllp (23) 

SG0T 

where p > 1, w$ are positive weights and X| S is the restriction of x to the elements contained in S- Since the 
nodes lower down in the tree appear in more groups than their ancestors, they will contribute more to 9(x) H gl 
and therefore will be more easily encouraged to be zero. The proximity operator of 9 H gl can be computed 
exactly for p = 2 and p = oc via an active set algorithm [75]. 

Other convex penalties have been recently proposed in order to favor hierarchical structures, but also al¬ 
lowing for a certain degree of flexibility in deviations from the discrete model. One approach considers groups 
consisting of all parent-child pairs and uses the latent group lasso penalty (see Section |T2| } in order to obtain 
solutions whose support is the union of few such pairs [110], see Figure [lO] (left). 

An interesting extension is given by the family model [13,1391, where the groups consist of a node and all 
its children, see Figure |l0] (right). Again the latent groups lasso penalty is used. This model is better suited for 
wavelet decomposition of images because it better reflects the fact that a large coefficient value implies large 
coefficients values for all its children at a finer scale. 

For both these cases, one can use the duplication strategy to transform the overlapping proximity problem 
into a block one, which can be efficiently solved in closed-form [69]. 


6 Submodular models 

Most of the structures described so far are naturally combinatorial, but the models presented in the previous 
sections are incapable of capturing more complex structures. A more general approach is to use a set function 
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R to quantify how much a given support set deviates from the desired structure. For instance, we can describe 
all previous structures by using an indicator function that assigns an infinite value to sets that do not belong to 
the chosen structure M k . Next, we provide a definition of such set functions and the sets that they describe: 

Definition 6.1 Given a set function R : 2^ M, we can define a model consisting of all sets S for which R(S) < r: 

:= {S | R(S) < r} . (24) 


Unfortunately, computing the projection of a given signal onto this set is not generally feasible for any set 
function, due to their intractable combinatorial nature. 

Therefore, it is necessary to restrict our attention to set functions with specific properties that lead to 
tractable problems. It turns out that in many applications (See for e.g, |2 3| 6j26 62.78,97,1151) the combina¬ 
torial penalties have a convenient "diminishing returns" property that we can exploit. These are the so-called 
submodular set functions. 


Definition 6.2 A set function R : 2^ R is submodular iff R{S U {v}) - R(S) > R(U U {v}) - R(U) / VS C U C N, 
and v e'N\U. 

Given a combinatorial penalty that encodes the desired structure, we want to solve an inverse problem of 
the form: 

minimize/(x) subject to supp(x) G !R r . (25) 

To simplify things, one can relax the hard constraint required in j25) , by using the combinatorial penalty as a 
regularizer to discourage unwanted patterns of the support: 

minimize /(x) + A • R (supp(x)), (26) 

x€l n 

for some regularization parameter A > 0. 


Unfortunately, the composite minimization of a continuous and discrete function required in |26| > leads to 
an NP-hard problem, even for submodular set functions. In fact, even in the simple sparsity case in CS setting, 
where /(x) := l||u - ^x|||, and R(S ) = |S|, the problem |26]) is NP- hard. 

When submodular functions are paired with continuous functions like in ( |26| ), the minimization becomes 
very difficult. However when considered alone, submodular functions can be efficiently minimized, see Sec¬ 
tion |7.1.2| Submodular function minimization constitutes a key component in both the convex and discrete 
approaches to solving |26] >. 


6.1 The discrete model 

In the discrete setting, ( [26} is preserved in an attempt to faithfully encode the discrete model at hand. While 
such criterion seems cumbersome to solve, it has favorable properties that lead to polynomial solvability, irre¬ 
spective of its combinatorial nature: there are efficient combinatorial algorithms that provide practical solutions 
to overcome this bottleneck and are guaranteed to converge in polynomial time ||43j; see Section [7d| 

Within this context, our intention here is to present fundamental algorithmic steps that reveal the underly¬ 
ing structure of ( |26} . Here, we assume that / has an L-Lipschitz continuous gradient and, thus, it admits the 
following upper bound: 

/(x) </(x l ) + (V/(x l ),x-x l ) + |||x-x*||| :=Q(x,x l ) Vx,x* e dom(/). 

Then, we perform the following iterative Majorization-Minimization (MM) scheme: 

x 2+1 G argmin {Q(x, x 2 ) + Ai?(supp(x))} = argmin < min Q(x, x 1 ) + A R(S) > (27) 

x£l n SCX [x:supp(x)=S J 

If we focus on in the inner minimization above and for a given support S, one can compute the optimal mini- 
mizer x l as: x| c = 0 and x\ = x| — (1/L) V/(x 2 )§. By substituting x\ into the upper bound Q(-, •), we obtain a 
modular majorizer M % \ 

/(x) + Ai?(supp(x)) < C - ^ ( x j - ^ v /(x*)i) + Ai?(supp(x)) 

j'Gsupp(x) V 

:= M 2 (supp(x)) + Aii(supp(x)), 

where C is a constant. Finally, the update ( [27] ) then becomes: 

x 2+1 = x| subject to S G arg min M l (S) + R(S) (28) 

sex 

The minimization in ( [28| is a submodular function minimization (SFM) problem, since the majorizer M l + R is 
submodular, and thus can be done efficiently. 
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Proposition 2 (|43][) At each iteration, the new estimate x 2+1 produced by the Majorization-Minimization algorithm 
satisfies 

/(x i+1 ) + R( supp(x* +1 )) < /(x 3 4 ) + i?(supp(x*)), 

w/zzc/z implies convergence. 


Note that this scheme is a type of proximal gradient method (cf., Sect. 7.2.1) , applied to submodular reg- 
ularizers, where the proximal operation in ( |4l) corresponds to the SFM step in |28) . We refer the reader to 
Section [72] for more information. 


6.2 Convex approaches 

The non-convex problem ( |26] > can be converted to a closely related convex problem by replacing the discrete 
regularizer g(x) = ii(supp(x)) by its largest convex lower bound. This is also called convex envelope and is 
given by the biconjugate g **, i.e. by applying the Fenchel-Legendre conjugate twice [18]. We call this approach 
a convex relaxation of ( |26] >. However, for some functions g, there is no meaningful convex envelope. For these 
cases, one can use the convex closure R~ : [0, l] n —>► M, of i?(S), which is the point-wise largest lower bound 
of R pTf . Note that both the convex envelope and the convex closure are "tight" relaxations, but one in the 
continuous domain, the other in the discrete domain. 

Both notions of convex relaxation for submodular functions use the Lovasz extension (LE), introduced in 
(88) . We give here only one of the equivalent definitions: 

Definition 6.3 Given a set function R such that ii(0) = 0, we define its Lovasz extension as follows: \/x e R N , 

N 

r(x ) = (R({.iu ■ ■ ■ ,jk}) ~ R({ji, ■ ■ ■ , jfc-i})) (29) 

k=l 


where Xj 1 > 


> x Jn- 


The Lovasz extension is the convex closure of its corresponding submodular function on the unit hypercube, 
i.e R~ = r [411. In this sense, this extension already gives a convex relaxation for any submodular function. 
However, it turns out that using the Lovasz extension as a convexification might not always fully capture the 


structure of the discrete penalty. We elaborate more on this in Section 6.3 

For a monoton^] submodular function its convex envelope is also obtained through the LE. 


Proposition 3 ( ©> Given a monotone submodular function R, s.t ii(0) = 0, and its Lovasz extension r , the convex 
envelope of g(x) = R( supp(x)) on the unit loo-ball is given by the norm Q sub (x.) := g**(x) = r(|x|). 


The proximity operator |2) of Ll sub can be efficiently computed. In fact, computing prox^ sub (x) is shown 
in Q to be equivalent to minimizing the submodular function XR(S) — \ x i\ using the minimum-norm 

point algorithm. Other SFM algorithms can also be used, but then solving a sequence of SFMs is needed. Note 
that simpler subcases of submodular functions may admit more efficient proximity operator than this general 
one. 

As a result, the convex relaxation of ( |26| , where i?(supp(x)) is replaced by Ll sub (x), can then be efficiently solved 
by a proximal gradient method (cf.. Sect. |7.2.1| K 


The monotonicity requirement in |3| is necessary to guarantee the convexity o f r(| x|). Unfortunately, all 
symmetric^] submodular functions, among which undirected cut functions (cf.. Sect. 6.3 ), are not monotone. In 
fact, the convex envelope of R( supp(x)) on the unit ^oo-ball, for R any submodular function, is given by [2|: 


*(x) = min r(5). 

<5E[0,l] n ,<5>|x| 


(30) 


Thus, when R is non-decreasing, the convex envelope is r(|x|) as stated in Proposition^ When R is symmetric 
and R(fH) = ii(0) = 0 (which is assumed w.l.o.g since addition of constants doesn't affect the regularization), 
g** (x) = 0, Vx G M n and the minimum in ( [30] > is attained at any constant value vector S. 

So the convexification of symmetric submodular functions consists of the Lovasz extension alone. However 
the LE is a poor convexification that can significantly modify the intended structure. This can already be seen 
by the fact that the LE is tight only on the unit hypercube, while most penalties R( supp(x)) are not constrained 
there, and that i?(supp(x)) is symmetric around the origin which is not the case for r(x). Some artifacts of using 
the LE as a convexification for symmetric functions are illustrated in (43) . 

For some problems, the submodularity requirement can be relaxed. Convex relaxations of general set func¬ 
tions, when paired with the f^-norm (p > 1) as a continuous prior are studied in (l07) . 

3 A monotone function is a function that satisfies: VS C T C K, R(S) < R( T) 

4 A symmetric function is a function that satisfies: VS C H, R( S) = 7?(K \ S) 
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Proposition 4 (|107|) Define the norm 


(31) 


fi p(x) := min{ ^ R( S) i ||v s || p | ^ v s = x}, 

vGV sex sex 

where 1 + 1 = 1 and V = {v = (v s )§cx £ (M n ) 2 ^ s.t supp(v s ) C 8}. 

Q p is the convex positively homogeneous envelope of the function Ai?(supp(x)) + /x||x||g (up to a constant factor), and 
the convex envelope o/i?(supp(x))«||x|| p . 

Note that this is the same norm as with weights dj given by a set function R, such that dj = -R(Sj) 1 ^, and 
the group structure 0 is the entire power set of N. 

The convex relaxation proposed in Proposition |4j when applied to a submodular function, generalizes the 
relaxation of Proposition 15] to unit balls with respect to the ^ p -norm (p > 1) and not only to the ioo one. 


6.3 Examples 

We now offer some examples of structures that can be described via submodular functions. 




Figure 11: Wavelet coefficients and un- Figure 12: Background subtraction and 
derlying hierarchical group model. underlying Ising model. 


Simple sparsity: The most ubiquitous structure in practice is simple sparsity, which corresponds to the set 
function R(S) = |S| that measures the cardinality of S. The convexification obtained by Proposition [5] is the 
usual ^i-norm. 


Group sparsity: Given a group structure 0, a group sparse model can be enforced by penalizing the number 
of groups that intersects with the support: i? n (S) = SsnSi/0 ^ where di > 0 is the weight associated with 
the group Si E 0- For example, defining the groups to be each node and all of its descendants on a tree (see 


Figure 
(cf.. Sect 


enforces the support to form a rooted connected tree, which is characteristic of wavelet coefficients 
The way the groups and their weights are defined leads to different sparsity patterns ||3|. 


The convexification of i? n (S) is the ^i/^oo-norrr^jover groups X^ee^lWloo (even for overlapping groups) 
(2.140]. Note that for groups that form a partition of N, i? n (S) is equivalent to the minimum weight set cover, 
defined as: 

M M 

Rsc(&) = min y^diSi s.t > l n § (32) 

sGB m z z ' 

2=1 2 = 1 


with the same set of groups 0 and weights d. R sc computes the total weight of the lightest cover for S using 
the groups in 0. Note that R sc {S) is not a submodular functior^] unless the groups form a partition. 

Ising model: The Ising model (92) is a model that associates with each coefficient Xi of a signal x e M n a 
discrete variable s* e {—1,1} to represent the state (zero/non-zero) of the coefficient. The Ising penalty enforces 
clustering over the coefficients, which is a desired structure for example in background subtraction in images 
or videos (26) . It can be naturally encoded on a graph S = (V, £) where the vertices are the coefficients of the 
signal, i.e V = N, and the edges connects neighboring coefficients. For example, for images, the coefficients of 
the signal are the pixels of the image, and edges connect pixels next to each other, forming a lattice (see Figure 
[l2| . This is called the two dimensional lattice Ising model. The Ising penalty is then expressed via the following 
symmetric submodular function: 


-Rising (S) 



(33) 


5 Actually, it is a norm iff N = Ud->o Gi 

6 Consider this example: let N = {1, 2,3,4}, and 0 = {Si = {1}, S 2 = {2,3}, S 3 = {1, 2,4}}, with weights defined as di = |Si|- Then 
the inequality in definition | 6 . 2 | is not satisfied for the sets S = {1, 2 } for which R sc (S) = 3, and T = {1, 2 ,4} for which R sc ( T) = 4 ,with 
the addition of the element {3}. 
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where s G M n is an indicator vector for the set 8 such that Si = 1 if i E 8 and = —1, otherwise. When 
8 = supp(x) for a signal x e M n , s* encodes the state of the corresponding coefficient X{. We can also view 
the Ising penalty as a cut function: Rising(S) indeed just counts the number of edges that are cut by the set 8 . 
Cut functions with appropriate graphs and weights actually capture a large subset of submodular functions 
172,781. Since Rising is symmetric, its convexification is just its Lovasz extension, as discussed above, which is 
shown (43) to be the anisotropic discrete Total Variation semi-norm 

II x I|tV= ( 34 ) 

(i-j')ee 

While the Ising model enforces the clustering of non-zero coefficients, its convexification encodes a different 
structure by favoring piece-wise constant signals. Furthermore, this penalty cannot be described by any of the 
structures introduced in the previous section. 

Entropy: Given n random variables Xi, • • • , X n/ the joint entropy of X§ = (Xi)i e § defines a submodular 
function, R(§ ) = H(X§). Moreover, the mutual information also defines a symmetric submodular function, 
R(S) = I(Xs,Xn\s). Such functions occur, for example, in experimental design 11151, in learning of Bayesian 
networks |62|, in semi-supervised clustering |97j, and in diverse feature selection |36|. Both of these set func¬ 
tions encourage structures that cannot be defined by the models presented in the previous sections. 

Supermodular functions: Another variant of set function regularizers used in practice, for example in di¬ 
verse feature selection, are supermodular functions, i.e. the negative of a submodular function. The problem 
( |26| with supermodular regularizers can then be cast as a maximization problem with a submodular regular¬ 
izes The notion of "approximate" submodularity introduced in |37j79 1 is used to reduce the problem to an "ap¬ 
proximate" submodular maximization problem, where several efficient algorithms for constrained/unconstrained 
submodular maximization |21||22| with provable approximation guarantees can be employed. For more infor¬ 
mation, we refer the reader to 136]. 


7 Optimization 

In order to use the previous structures in practice, one needs efficient optimization solutions for structured 
sparsity problems that scale up in high-dimensional settings. From our discussions above, it is apparent that 
the key actors for this purpose are projection and proximity operations over structured sets that go beyond 
simple selection heuristics and towards provable solution quality as well as runtime/space bounds. 

Projection operations faithfully follow the underlying combinatorial model but, in most cases, result in 
hard-to-solve or even combinatorial optimization problems. Furthermore, model misspecification often results 
in wildly inaccurate solutions. 

Proximity operators of convex sparsity-inducing norms often can only partially describe the underlying 
discrete model and might lead to "rules-of-thumb" in problem solving (e.g., how to set up the regularization 
parameter). However, such approaches work quite well in practice and are more robust to deviations from the 
model, leading to satisfactory solutions. 

Here, our intention is to present an overview of the dominant approaches followed in practice. We consider 
the following three general optimization formulations 0 

• Discrete projection formulation: Given a signal model M k , let / : R n —> R be a closed convex data 
fidelity/loss function. Here, we focus on the projected non-convex minimization problem: 

minimize /(x) subject to x e M&. ( 35 ) 


• Convex proximity formulation: Given a signal model M k , let / : W 1 R be a closed convex data 
fidelity/loss function, g : M n — M a closed convex regularization term, possibly non-smooth, that faith¬ 
fully models M k and A > 0 . In this chapter, we focus on the convex composite minimization problem: 

min imize /(x)+A- 5 (x). ( 36 ) 

xGl n v ' 


• Convex structured-norm minimization: Given a signal model M k/ let g : W 1 M be a closed convex 
regularization term, possibly non-smooth, that faithfully models M k . Moreover, let / : M n -y R be a 
closed convex data fidelity/loss function and cr > 0 . We consider the following minimization problem: 

minimize g(x) subject to /(x) < cr. ( 37 ) 


7 We acknowledge that there are other criteria that can be considered in practice; for completeness, in the simple sparsity case, we refer 
the reader to the A-norm constrained linear regression (a.k.a. Lasso |123 |)—similarly, there are alternative optimization approaches for 
the discrete case |134| . However, our intention in this chapter is to use the most prevalent formulations used in practice. 
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In addition, we also briefly mention submodular function minimization (SFM) 

minimize R(S) 

where R : 2 ^ M is a submodular function; see Definition | 6 . 2 [ SFM is at the core of both discrete and convex 
approaches for solving estimation problems, where the structure is imposed via a submodular set function; see 
Section [ 6 ] 

Some representative problem instances are given next. 

Compressive sensing (CS): In classic CS, we are interested in recovering a high-dimensional signal x* e 
from an underdetermined set of linear observations u e M m (m <n), generated via a known <3> e M mxn : 

u = 4>x* + e. 


Here, e e M m denotes an additive noise term with ||e|| 2 < a. 

For /(x) := \ ||u — ^x||| and using the fact that x* e one can formulate the problem at hand as in (35); 
observe that, in the case where M k = E&, (35) corresponds to the pseudo-norm optimization problem 1981. 
Alternatively, given a function g that encourages solutions to be in M/c, one can solve the CS problem with 
convex machinery as in (36)-(37); when M k = Yi k , (36)-(37) corresponds to the Basis Pursuit DeNoising (BPDN) 
criterion (31). 

Sparse graph modeling: In Gaussian graph model selection [34 
maximum likelihood estimation yields the criterion: 

x = argmin j — logdet(mat(x)) + trace(mat(x) • C) + A • g(x) j, 

xGM ” 2 


39 64 65,82 83 127), the sparse-regularized 


where x e M n2 is the vectorized version of the inverse covariance estimate, mat(-) is the linear matricization 
operation and C is the sample covariance. Developments in random matrix theory [76| divulge the poor perfor¬ 
mance of solving (38) without regularization: in that case, the solution is usually fully dense and no inference 
about the dependence graph structure is possible. Choosing a suitable regularizer g [71 |85j87), that models the 
connections between the variables, reduces the degrees of freedom, leading to robust and more interpretable 
results. 


Other applications: Similar sparsity regularization in discrete or convex settings is found many diverse dis¬ 
ciplines: low-light imaging problem in signal processing [59], heteroschedastic LASSO |^5|, dictionary learn¬ 
ing, [ 122 ], etc. 


7.1 Greedy and discrete approaches 

7.1.1 Projected gradient descent and matching pursuit variants 

Iterative greedy algorithms maintain the combinatorial nature of (35) , but instead of tackling it directly, which 
would be intractable, the algorithms of this class greedily refine a /c-model-sparse solution using only ^local" 
information available at the current iteration. Within this context, matching pursuit approaches (91], 129j grad¬ 
ually construct the sparse estimate by greedily choosing the non-zero coefficients that best explain the current 
residual, e.g. y — Ax 2 , where x 2 is the current iterate. Extensions of such greedy approaches have been recently 
proposed to accommodate structured models in the selection process [67]. 

Most of the algorithmic solutions so far concentrate on the non-convex projected gradient descent algo¬ 
rithm: a popular method, known for its simplicity and ease of implementation. Per iteration, the total com¬ 
putational complexity is determined by the calculation of the gradient and the projection operation on M k 
as in (l). Most algorithms in the discrete model can be described or easily deduced by the following simple 
recursion: 

X i+1 = ?M k (x‘ - gv/^)) , (39) 


where g is a step size and (•) is the projection onto fc-model-sparse signals. 

Representative examples for the CS application are hard thresholding methods over simple sparse sets 
[14,27i|46j80 J84}99[. | 8 | further extends these ideas to model-based CS , where non-overlapping group struc¬ 
tures and tree structures are used to perform the projection ©• From a theoretical computer science perspec¬ 
tive, @[ 68 | address the CS problem using sparse sensing matrices. Exploiting model-based sparsity in recovery 
provably reduces the number of measurements m, without sacrificing accuracy. The resulting algorithm re¬ 
duces the main computational cost of the proposed scheme on the difficulty of projecting onto the model-sparse 
set, which, as mentioned in Section [3J in most relevant cases, can be computed in linear time using dynamic 
programming]^] 


8 In the case of CS, an important modification of (39) to achieve linear computational time per iteration is the substitution of the 
gradient with the median operator, which is nonlinear and defined component-wise on a vector; for more information, we refer to [ |46p2| . 
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7.1.2 Submodular function minimization 


Submodularity is considered the discrete equivalent of convexity in the sense that it allows efficient minimiza¬ 
tion. The best combinatorial algorithm for submodular function minimization (SFM) has a proven complexity 
of 0(n 5 T + n 6 ), where T is the function evaluation complexity 11091. For practical purposes however, another 
algorithm called minimum-norm point algorithm | [48) is usually used. This algorithm has no known worst 
case complexity but in practice it usually runs in 0(rr). Furthermore, for certain functions which are "graph 
representable" [49,72 J, submodular minimization becomes equivalent to computing the minimum s-t cut on 
the corresponding graph G(V, £), which has time complexit)mO(|£| min{|V| 2 / 3 , l^ 1 / 2 }) (54). 


7.2 Convex approaches 
7.2.1 Proximity methods 

Proximity gradient methods are iterative pr ocess es that rely on two key structural assumptions: i) / has Lip- 
schitz continuous gradienj^] (see Definition 2.1 ) and ii) the regularizing term g is endowed with a tractable 
proximity operator. 

By the Lipschitz gradient continuity and given a putative solution x 2 G dom(/ + g), one can locally approx¬ 
imate / around x 2 using a quadratic function as: 

/(x) < Q(x,x*) := f(x l ) + V/(x*) T (x-x*) + ^||x — x®|||, Vx e dom(F). 

The special structure of this upper-bound allows us to consider a majorization-minimization approach: instead 
of solving < [36] ) directly, we solve a sequence of simpler composite quadratic problems: 


j +1 


G argmin {Q(x, x 2 ) + #(x)} 


x€l n 


(40) 


In particular, we observe that ( |40) is equivalent to the following iterative proximity operation, similar to 

2 


i+1 • 1 

x ^ G argmin < - 

xGt n I ^ 


X 


X 




+ ^«(*) 


Here, the anchor point w in |2) is the gradient descent step: w := x 2 — k V/(x 2 


(41) 


In the case where g(x) := ||x||i, the proximity algorithm in ( [41) is known as the Iterative Soft-Thresholding 
Algorithm (ISTA) (28}|32[ 38|. Iterative algorithms can use memory to provide momentum in convergence. 
Based on Nesterov's optimal gradient methods [ 103 [, [ 101 proves the universality of such acceleration in the 
composite convex minimization case of ( |36) , where g(x) can be any convex norm with tractable proximity 
operator. However, the resulting optimization criterion in ( |4T) is more challenging when g stands for more 
elaborate sparsity structures. 

Within this context, [114 132 [ present a new convergence analysis for proximity (accelerated) gradient prob¬ 
lems, under the assumption of inexact proximity evaluations and study how these errors propagate into the con¬ 
vergence rate. 

An emerging direction for solving composite minimization problems of the form <[36) is based on the 


proximity-Newton method [50[. The origins of this method can be traced back to the work of [16,50], which 
relies on the concept of strong regularity introduced by [113] for generalized equations. In this case, we iden¬ 
tify that the basic optimization framework above can be easily adjusted to second-order Newton gradient and 
quasi-Newton approaches: 


c* +1 e argmin (L|x - (x® - H i \\ 2 + L(x) 

xGl n 1 Z 


(42) 


where represents either the actual Hessian of / at x 2 (i.e., V 2 /(x^)) or a symmetric positive definite matrix 
approximating V 2 /(x 2 ). Given a computationally efficient Newton direction, one can re-use the model-based 
proximity solutions presented in the previous subsection along with a second order variable metric gradient 
descent scheme, as presented in <[42) [ 128[. 


7.2.2 Primal-dual and alternating minimization approaches 

Several model-based problems can be solved using the convex structured-norm minimization in 
stylized example, consider the noiseless CS problem formulation 


minimize 

xGt n 


g(x) subject to u = <I>x, 


As a 


(43) 


9 the notation O(-) ignores log terms 

10 In 11281, the authors consider a more general class of functions with no global Lipschitz constant L over their domain. The description 
of this material is out of the scope of this chapter and is left to the reader who is interested in deeper convex analysis and optimization. 
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where g is a convex structured norm, mimicking the model at hand. 

Within this context, primal-dual convex optimization methods provide attractive approaches, by exploiting 
both the primal and the dual formulations at the same time (via Lagrangian dualization). More precisely, 
primal-dual methods solve the following minimax problem: 


min max jqr(x) + (<I>x — u, y)} . 

xGM n yGM m 1 V 7 X /J ' 


(44) 


where y is the Lagrange multiplier associated to the equality constraint u = 3>x. The minimax formulation ( 
can be considered as a special case of the general formulation in |29||104). 

If the function g is nonsmooth and possesses a closed form proximity operator, then general primal-dual 
methods such as [29,60[ have been known as powerful tools to tackle problem ( [44] ). Since g is nonsmooth, one 
can also use the primal-dual subgradient method [ 1061 as well as the prox-method proposed in [ 1021 for solving 
|43) . An alternative approach is to combine primal-dual optimization methods and smoothing technique as 
done in [104,105] for solving ( [44] ). 

When |44] > possesses a separable structure g(x) := X]f=i ^( x 0/ distributed approaches can be applied. 
Recently, the authors in [ 126 [ developed a unified primal-dual decomposition framework for solving ( [43] ), using 
also ideas from the alternating direction methods of multipliers [19,32 ,130 [. Such alternating optimization 
methods offer a unifying computational perspective on a broad set of large-scale convex estimation problems in 
machine learning [73] and signal processing |32) , including sparse recovery, deconvolution, and de-mixing (55) . 
They have numerous universality and scalability benefits and, in most cases, they are well-suited to distributed 
convex optimization. In this case, one can borrow ideas and tools from gradient descent and Newton schemes 
in solving the subproblems of the alternating minimization. While their theoretical convergence guarantees are 
not optimal, they usually work well in practice. 


8 Applications 


8.1 Compressive Imaging 

Natural images are usually sparse in wavelet basis. In this experiment, we study the image reconstruction 
problem from compressive measurements, where structured sparsity ideas are applied in practice. 

For this purpose and given a p x p natural grayscale image x G , we use the Discrete Wavelet Transform 
(DWT) with log 2 ( p ) levels, based on the Daubechies 4 wavelet, to represent x; see the Wavelet representation 
of two images in Figures |l3|[l4| In math terms, the DWT can be represented by an operator matrix W T , so 
that x can be sparsely represented (or well-approximated) as x = Wc, where c e M n , n := p 2 are the wavelet 
coefficients for x. 

To exploit this fact in practice, we consider the problem of recovering x e M n from m compressive mea¬ 
surements u G M m . The measurements are obtained by applying a sparse matrix A e R mx P to the vectorized 
image such that: 

u = Ax. 


Here, A is the adjacency matrix of an expander graph of degree d = 8, so that ||A||o = dn. Thus, the overall 
measurement operator on the wavelet coefficients is then given by the concatenation of the expander matrix 
with the DWT: u = AWc, with c « c with ||c||o <C m 2 , i.e., x can be well-approximated by using only a limited 
number of wavelet coefficients. 

We use the following methods for recovering c from the measurements u: 


minimize 

ceR n 

u — AWc ll 

subject to 

supp(c) e 7 k 

minimize 

ceR n 

llclli 

subject to 

u = AWc. 

minimize 

ceR n 

IIcIIhgl 

subject to 

u = AWc. 

minimize 

ceR n 

11 c 11 PC 

subject to 

u = AWc. 

minimize 

ceM n 

c FAM 

subject to 

u = AWc. 


(Rooted Connected Tree model (RC)) 


(Basis Pursuit (BP)) 


(Hierarchical Group Lasso (HGL) pursuit) 
(Parent-Child Latent Group Lasso (PC) pursuit) 


(Family Latent Group Lasso (FAM) pursuit) 
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The RC model is solved via the improved projected gradient descent given in 1801 with the projections 
computed via the dynamic program proposed in j5). All the remaining methods are solved using the primal- 
dual method described in 1 125 1 which relies on the proximity operator of the associated structure-sparsity 
inducing penalties. For BP the proximity operator is given by the standard soft-thresholding function. For 
HGL, we use the algorithm and code given by 1 75 . For the latent group lasso approaches, PC and FAM, we 
adopt the duplication strategy proposed in [ 69,108 , for which the proximity operator reduces to the standard 


block-wise soft-thresholding on the duplicated variables. All algorithms are written in Mat lab, except for the 
HGL proximity operator and the RC projection that are in C. 

The duplication approach consists in creating a latent vector that contains copies of the original variables. 
The number of copies is determined by the number of groups that a given variable belongs to. For the Parent- 
Child model, each node belongs to the four groups that contain each of its children and the group that contains 
its father. The root has only three children, corresponding to the roots of the horizontal, vertical and diagonal 
wavelet trees. The leaves belong only to the group that contains their fathers. A simple calculation shows that 
the latent vector for the PC model contains 2 (n — 1) variables. 

For the Family model, instead, each node belongs to only two groups: the group containing its children and 
the group containing its siblings and its father. Each leaf belongs to only the group that contains its siblings 
and its father. Overall, the number of variables in the latent vector for the Family models is equal to ^ — 1. 

The duplication approach does not significantly increase the problem size, while it allows an efficient imple¬ 
mentation of the proximity operator. Indeed, given that the proximity operator can be computed in closed form 
over the duplicated variables, this approach is as fast as the hierarchical group lasso one, where the proximity 
operator is computed via C code. 

In order to obtain a good performance, both the parent-child and the family model require a proper weight¬ 
ing scheme to penalize groups lower down in the tree, where smaller wavelet coefficients are expected, com¬ 
pared to nodes closer to the root, which normally carry most of the energy of the signals and should be penal¬ 
ized less. We have observed that setting the group weights proportional to the level L of the node of the group 
closest to the root gives good results. In particular, we set the weights equal to L 2 , with 0 being the root level. 


8.1.1 Results 


We performed the compressive imaging experiments on both a 256 x 256 portrait of a woman and a 2048 x 2048 
mountain landscape. Apart from conversion to grayscale and resizing through the Matlab function imresize, 
no preprocessing has been carried out. The primal-dual algorithm of [ 125 [ has been run up to precision 10 -5 . 
We measure the recovery performance in terms of Power Signal to Noise Ratio (PSNR) and relative recovery 
error i 2 norm as ^ x .r^ 2 , where x is the estimated image and x is the true image. 


Figures 13|14 report the recovery results, using m = |, that is using only 12.4% samples compared to the 
ambient dimension. The estimated images are on the top two rows, while the third and fourth rows show the 
estimated wavelet coefficients. 

The effect of imposing structured sparsity can be clearly seen for the HGL, PC and FAM models, where 
the high values of the coefficients tend to cluster around the root of the wavelet tree (i.e., top-left corner of the 
image) and their intensity decreases descending the tree. The family model shows the grouping among the 
siblings, where four leaves are either all zero or all non-zero. For the 256 x 256 image, despite being coded in 
C, the discrete model is approximately 160 times slower than the other methods, which are computationally 
equivalent: e.g., in our tests, the family model took around 60 seconds, while the RC one required almost 2 
hours. We therefore did not use the RC model on the larger 2048 x 2048 mountain image, but we compared 
also against Total Variation (TV) pursuit, which obtains the best performance on this image. 


8.2 Neuronal spike detection from compressed data 

In the experiments that follow, we compare the performance of the following three optimization criteria, as- 


suming the dispersive model 


minimize 

xGl n 

/(x) subject to x e Dfc. 

(Discrete dispersive) 

minimize 

xGR™ 

/( x ) T ^ ‘ ^exclusive ( x )- 

(Exclusive norm regularization) 

minimize 

xGM n 

^exclusive ( x ) Subject to /( x ) ^ & • 

(Exclusive norm pursuit) 


Empirical performance on synthetic data: Figures [L5][T6| illustrate the utility of each approach in the com¬ 
pressed sensing setting where /(x) := ^ ||u — ^E»x|| That is, we observe x* e M n through a limited set of linear 
sketches u = 3>x* + e e M m where <I> e M mxn is a known linear sketch matrix. Here, we assume n = 500 and 
m = 70 for ||x *|| 0 = 25. Without loss of generality, we assume (x*) • > 0, Vi and A* = 20. 

In the discrete case, we relax the refractory period A to model signal structure deviations; here, we assume 
A = 15. The discrete exclusive model |8}[99jl clearly outperforms the rest of the approaches under comparison; 
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ORIGINAL BP HGL 



PSNR 26.224 PSNR 25.543 PSNR 17.636 

ORIGINAL BP HGL 



Error 0.090 Error 0.097 Error 0.242 


Figure 13: Woman image recovery performance from compressive measurements. Here, 
p = 256. The top two rows show the reconstruction performance in the original domain, 
along with the PSNR levels achieved. The bottom two rows show the corresponding rep¬ 
resentations into the wavelet domain. 
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ORIGINAL 



FAM PC TV 



PSNR 25.945 



PSNR 26.698 



PSNR 28.324 


ORIGINAL BP 



Error 0.118 


HGL 


Error 0.080 


FAM 



Error 0.081 


PC 



Error 0.088 


TV 


Error 0.067 


Figure 14: Mountains image recovery performance from compressive measurements. 
Here, p = 2048. The top two rows show the reconstruction performance in the original do¬ 
main, along with the PSNR levels achieved. The bottom two rows show the corresponding 
representations into the wavelet domain. 
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Figure 15: Performance of the discrete dispersive approach for the problem spike train 
recovery from a limited set of linear measurements. 


such behavior is also observed on average over the set of experiments conducted (Figure [15|. This also implies 
that the discrete model usually requires fewer measurements for accurate recovery compared to conventional 
sparse approximation, as long as the underlying signal approximately follows T) k . 

On the other hand, due to convex relaxations, convex approaches introduce unnecessary nonzero coeffi¬ 
cients that do not comply with the underlying model. However, both approaches show good performance in 
recovering x* from limited measurements; see Figure |16| 

Real neuronal spike data: In order to understand the functioning of the human brain, it is necessary to identify 
and study the behavior of neuronal cell membranes under rapid change in the electric potential. However, to 
observe such phenomena, electrical activities on neurons need to be recorded using specialized equipment. In 
this experiment, we perform somatic spike detection of a tufted L5 pyramidal cell responding to in-vivo-like 
current injected in the apical dendrites and the soma simultaneously (see [45] for the experimental details). 

A snapshot of the neuronal spikes waveforms is shown in Figure [17] In order to accurately detect the 
neuronal spikes, a high-frequency sample acquisition equipment is required. Within this context, we apply CS 
ideas to decrease the number of samples needed to a ppro ximately detect the positions of the spike train. Let 
x* G M n with n = 832 represent the signal in Figure ll 7a j furthermore, let <I> G M mxn be the sensing matrix 
where m = 0.25 • n, i.e., we perform a 75% compression. 


Error: ||x — x*|| 2 = 0.30667 


Error: llx- x*|| 2 = 0.30118 
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(b) Exclusive norm pursuit 


Figure 16: Performance of dispersive convex relaxations for the problem of spike train 
recovery from a limited set of linear measurements. (Left panel:) Exclusive norm regular¬ 
ization approach. (Right panel:) Exclusive norm pursuit approach. 


We use the proposed schemes to recover the locations of the neuronal spikes under the assumption of the 
dispersive model with refractory period A. Here, A is set equal to the average period between two consecutive 
spikes. 

Figure |l7b| represents the recovered k -sparse approximation using the discrete dispersive model T) k . Here, k 
is set to the number of spikes expected to appear for a given time period—such number can be easily deduced 
by observing the behavior of a specific neuron type. From Figure |17b| we observe that the discrete model 
approximates the locations of the spikes quite accurately: most of the spike locations are exactly recovered. 
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However, due to the "strictness" of the discrete model, we observe that small deviations from lead to 
imprecise estimations; e.g., between the 12 th and 13 th spike of the sequence, a larger (than usual) refractory 
period is observed that leads to mis-location of the next spike estimation. 

Figures |l7c]|17d| depict the performance of convex solvers using the exclusive norm as (i) regularizer and (ii) 
objective function. Tweaking the A parameter in the (i) case, one can achieve sparse solutions that approximate 
the underlying model (Figure |l7c) ; however, one can observe multiple detected spikes with separation less 
than A, violating the assumed model. In the model-based Basis pursuit case, the solver tries to fit the solution 
to the data, which usually leads to less sparse solutions (Figure |l7d| ). One can further sparsity the convex 
solutions to obtain a k -sparse answer as in Figures 17e||17f| however, in most cases, further processing of the 
returned signal is required to maintain a O/c-modeled solution. E.g., in this case, due to the fact that convex 
norms force the solution to fully explain the observations, the sparsified solution includes more than one spike 
per true spike location. 
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(a) Original neuronal spike train signal. 


200 400 600 800 

Time index 

(b) Spike detection using discrete model. 
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(c) Spike detection using the exclusive (d) Spike detection using the Model- 
norm-regularized convex approach. based Basis pursuit. 




200 400 600 800 

Time index 


200 400 600 800 

Time index 


(e) k -sparse approximation of the exclu- (f) k -sparse approximation of the model- 
sive norm-regularized convex solution. s based Basis pursuit solution. 


Figure 17: Spike detection in real neuronal data using the dispersive model. Figure [l7a| 
depicts the original signal x*. We observe u = <l>x* using only 25% measurements through 
a linear sketch <l>. Figures 1 7b|l 7d| illustrate the performance of the three approaches under 
comparison. Figures 17e||17f show the convex solutions, sparsified to be k- sparse. 
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8.3 Digital Confocal Imaging 


Confocal imaging [95,117] has become one of the best techniques for 3D imaging, due to its ability to reduce 
the blur caused by out-of-focus scattering by means of focus light and a physical pinhole. The recent work of 
Goy et al. [561 combines this technique with digital holography which allows access to the complex field at the 
recording CCD sensor and enables a novel and flexible way of eliminating the out-of-focus contribution via a 
so-called "virtual pinhole" in the digital domain. Under the assumption that the scattered light is not distorted 
by the matter in the light cones around the focus, this makes it possible to obtain very sharp images of the 
sample at all depths. 

We are interested in modeling the scattering process so that we can instead leverage the information car¬ 
ried by the out-of-focus scattering for reducing the number of focusing locations that are needed for a faithful 
imaging of the sample. We create a linear model by adopting the classical first Born approximation for the 
scattering [17[. Our model takes into account the specific form of the Gaussian beam used to focus the light in¬ 
side the sample, the numerical aperture of the transmission lens that determines which fraction of the scattered 
light is captured and the resolution and size of the CCD that records the hologram. For simplicity, we model 
the scattered field as it would arrive on the CCD and not its interaction with the reference field normally used 
to obtain the hologram. We use a 3D discrete model for representing the imaged sample, where one 3D pixel, 
or voxel, contains the average value of the scattering potential in the volume contained therein. 

For a specific focus location i, the measurement model A$ maps a 3D object represented by its scattering 
potential F to a 2D complex field S. The complete measurement model A is then given by the concatenation of 
A* for all focus points 

Ai(F) 

A 2 (F) 


A(F) = 


A P (F) 


where P is the total number of focus points. 

For the experiments in this section, we consider a 64 x 8 x 64 volume filled with oil with refractive index 
1.52, containing two concentric cylinders with axis aligned with the y axis. Their refractive indexes are 1.54 for 
the outer one and 1.56 for the inner one. A x-z cross section of their scattering potential is shown in Figure |l8| 
(Left). Note that F is equal to zero outside the cylinder and has constant value inside each. These properties 
could be captured via sparsity and clusteredness, for example via norm or Total Variation minimization or 
via the Ising discrete model; see Section [6] We choose 16 focus locations taken from a uniform grid over the 
central x-z plane. The focusing lens and the transmission lens have numerical aperture 1 and 0.65 respectively, 
while the coherent light of the Gaussian beam has wavelength A = 405 • 10 -9 . We consider a 91 x 91 CCD 
that covers the entire field of view of the transmission lens. We simulate the measurements by applying the 
measurement model to the synthetic scattering potential F, that is u = A(F), where we have F e M 64x8x64 and 

u G C 16x91x91_ 

Despite the apparent oversampling ratio, the problem is still ill-posed, because each recorded 
field carries information only about a small neighborhood around each focus point. 

We compare four methods for estimating F from the measurements u. Let the data-fit term be the square 
loss L(F) = H|u-A(F)|||. 


minimize 

F 

minimize 

F 

minimize 

F 


L( F). 

||F||i subject to 
||F|| T y subject to 


u = A(F) . 
u = A(F) . 


(Conjugate Gradient (CG)) 
(Basis Pursuit (BP)) 
(Total Variation (TV) pursuit) 


minimize L( F) + Ai?i S i NG (supp(F)) + r| supp(F)| , (Ising plus Cardinality (IC)) 

In the last approach, one has two regularization parameters A, r > 0 to be set. 


8.3.1 Results 

We evaluate the recovery performance of the methods using the relative recovery error ||F — F*|| 2 /1|F*|| 2 , 
where F* is the synthetic scattering potential used to simulate the output y. We explored several pairs of 
regularization parameters A and r for the discrete IC model, selecting the pairs that gave the best performance. 
Of course, in practice one cannot resort to this type of parameter exploration and needs to chose them according 
to knowledge regarding the noise level and the desired amount of clusteredness. Furthermore, this method is 
very sensitive to changes in the regularization parameters, making it difficult to find the best value for them. 

Figure [18] (right) reports the results on this experimental setup. Total Variation pursuit achieves the best per¬ 
formance without having to tune any parameter, closely followed by the discrete model. Despite the scattering 
potential is also sparse. Basis Pursuit enforces too much sparsity, especially in the regions of the sample that lie 
farthest away from the focus points and hence do not contribute to the the output u. It is interesting to note, 
that in this case, enforcing the incorrect structure leads to poorer performance than unstructured recovery. 
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Error 0.41 Error 0.43 


Figure 18: (Left) Central section of the sample to be reconstructed. The black dots represent 
the 16 focus locations. (Right) Reconstruction results for the four methods. 
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